Goto

Collaborating Authors

 Chile


Fastest comet ever recorded spewed 70 Olympic pools' worth of water daily

Popular Science

Science Space Deep Space Fastest comet ever recorded spewed 70 Olympic pools' worth of water daily More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. A new study of the interstellar comet 3I/ATLAS led by the University of Michigan shows that its water has a remarkably high content of deuterium. This form of hydrogen is comparatively less abundant in our solar system, enabling researchers to glean new insights about other planetary processes at work in our galaxy. Breakthroughs, discoveries, and DIY tips sent six days a week. Astronomers knew 3I/ATLAS wasn't a local comet not long after first spotting it in July 2025 .


Two mountain lion cubs rescued from certain death

Popular Science

Crimson and Clover are now on the road to recovery at Oakland Zoo in California. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. Crimson (left) was rescued shortly after Clover(right). Breakthroughs, discoveries, and DIY tips sent six days a week. Mountain lions (, cougars, pumas, among its many other names) are carnivorous, sharp-toothed and clawed big cats.


MinShap: A Modified Shapley Value Approach for Feature Selection

Zheng, Chenghui, Raskutti, Garvesh

arXiv.org Machine Learning

Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emph{MinShap}, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through numerical simulations and real data experiments that MinShap tends to outperform state-of-the-art feature selection algorithms such as LOCO, GCM and Lasso in terms of both accuracy and stability. We also introduce a suite of algorithms related to MinShap by using the multiple testing/p-value perspective that improves performance in lower-sample settings and provide supporting theoretical guarantees.


Inside the UFO hotel in Wales - with 'spacecraft' door, NASA-designed interiors and Doctor Who TARDIS bathroom

Daily Mail - Science & tech

The world's most family-friendly landmarks revealed - with six UK spots making the top 50 The UK's best staycations revealed by Daily Mail Travel - from a Gara Rock beach proposal to an £80-a-night mansion retreat This sun-drenched European coast offers great value - and it's just a two-hour flight away Don't get caught out by Ryanair's small bag restrictions - I've tested the carry-on suitcases and underseat bags that beat the strict requirements Why heading to Salcombe, one of Britain's most expensive seaside towns, in the shoulder season is an off-peak treat - and what to do there Tired of fun! Middle class families who turn their noses up at Butlin's are missing out Luxury hotel owner in Cornwall offers to foot British tourists' petrol bills to ease financial pain of staycation With flights disrupted amid Iran war, these are Europe's easiest countries to navigate by train - and how it compares to flying for price and time How to retire to the seaside for as little as £90,000 - and Britain's best hidden beach home spots New business class seats with IMAX-style wrap-around screens revealed - making passengers feel like they're in the cinema How the cost of your staycation REALLY compares with a'cheap' holiday abroad - when you factor in everything from food to fuel Why the Lake District shouldn't introduce tourism tax, says Cumbria tourism boss How Marseille became Europe's Capital of Cool - with 20 degree sunshine, sea views and amazing seafood The world's best food markets revealed - and a UK spot comes in second place READ MORE: The best hotels in the UK for 2026 revealed - does YOUR favourite make the list? Ready to hit the mute button on reality? Deep in the Pembrokeshire countryside lies a cosmic retreat that feels almost light years away from Earth. The awe-inspiring Spodnic UFO is one of three standout stays at Melin Mabes, a four-acre glamping site owned and ran by Martin Johnson and his wife, CarolAnne. 'It looks like it's just landed from outer space and aliens could come out,' Martin notes as he showcases his brainchild during the first episode of Channel's World's Most Secret Hotels.


Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Ye, Jiayuan, Feldman, Vitaly, Talwar, Kunal

arXiv.org Machine Learning

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.


A unifying view of contrastive learning, importance sampling, and bridge sampling for energy-based models

Martino, Luca

arXiv.org Machine Learning

In the last decades, energy-based models (EBMs) have become an important class of probabilistic models in which a component of the likelihood is intractable and therefore cannot be evaluated explicitly. Consequently, parameter estimation in EBMs is challenging for conventional inference methods. In this work, we provide a unified framework that connects noise contrastive estimation (NCE), reverse logistic regression (RLR), multiple importance sampling (MIS), and bridge sampling within the context of EBMs. We further show that these methods are equivalent under specific conditions. This unified perspective clarifies relationships among existing methods and enables the development of new estimators, with the potential to improve statistical and computational efficiency. Furthermore, this study helps elucidate the success of NCE in terms of its flexibility and robustness, while also identifying scenarios in which its performance can be further improved. Hence, rather than being a purely descriptive review, this work offers a unifying perspective and additional methodological contributions. The MATLAB code used in the numerical experiments is also made freely available to support the reproducibility of the results.


Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Hao, Yongchang, Mou, Lili

arXiv.org Machine Learning

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.


Do covariates explain why these groups differ? The choice of reference group can reverse conclusions in the Oaxaca-Blinder decomposition

Quintero, Manuel, Shreekumar, Advik, Stephenson, William T., Broderick, Tamara

arXiv.org Machine Learning

Scientists often want to explain why an outcome is different in two groups. For instance, differences in patient mortality rates across two hospitals could be due to differences in the patients themselves (covariates) or differences in medical care (outcomes given covariates). The Oaxaca--Blinder decomposition (OBD) is a standard tool to tease apart these factors. It is well known that the OBD requires choosing one of the groups as a reference, and the numerical answer can vary with the reference. To the best of our knowledge, there has not been a systematic investigation into whether the choice of OBD reference can yield different substantive conclusions and how common this issue is. In the present paper, we give existence proofs in real and simulated data that the OBD references can yield substantively different conclusions and that these differences are not entirely driven by model misspecification or small data. We prove that substantively different conclusions occur in up to half of the parameter space, but find these discrepancies rare in the real-data analyses we study. We explain this empirical rarity by examining how realistic data-generating processes can be biased towards parameters that do not change conclusions under the OBD.


Enhancing Online Support Group Formation Using Topic Modeling Techniques

Barman, Pronob Kumar, Reynolds, Tera L., Foulds, James

arXiv.org Machine Learning

Online health communities (OHCs) are vital for fostering peer support and improving health outcomes. Support groups within these platforms can provide more personalized and cohesive peer support, yet traditional support group formation methods face challenges related to scalability, static categorization, and insufficient personalization. To overcome these limitations, we propose two novel machine learning models for automated support group formation: the Group specific Dirichlet Multinomial Regression (gDMR) and the Group specific Structured Topic Model (gSTM). These models integrate user generated textual content, demographic profiles, and interaction data represented through node embeddings derived from user networks to systematically automate personalized, semantically coherent support group formation. We evaluate the models on a large scale dataset from MedHelp, comprising over 2 million user posts. Both models substantially outperform baseline methods including LDA, DMR, and STM in predictive accuracy (held out log likelihood), semantic coherence (UMass metric), and internal group consistency. The gDMR model yields group covariates that facilitate practical implementation by leveraging relational patterns from network structures and demographic data. In contrast, gSTM emphasizes sparsity constraints to generate more distinct and thematically specific groups. Qualitative analysis further validates the alignment between model generated groups and manually coded themes, showing the practical relevance of the models in informing groups that address diverse health concerns such as chronic illness management, diagnostic uncertainty, and mental health. By reducing reliance on manual curation, these frameworks provide scalable solutions that enhance peer interactions within OHCs, with implications for patient engagement, community resilience, and health outcomes.


Minimax Generalized Cross-Entropy

Bondugula, Kartheek, Mazuelas, Santiago, Pérez, Aritz, Liu, Anqi

arXiv.org Machine Learning

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.